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Abstract 

Popular sparse estimation methods based on 
^i-relaxation, such as the Lasso and the 
Dantzig selector, require the knowledge of 
the variance of the noise in order to prop- 
erly tune the regularization parameter. This 
constitutes a major obstacle in applying these 
methods in several frameworks — such as time 
series, random fields, inverse problems — for 
which the noise is rarely homoscedastic and 
its level is hard to know in advance. In 
this paper, we propose a new approach 
to the joint estimation of the conditional 
mean and the conditional variance in a high- 
dimensional (auto-) regression setting. An 
attractive feature of the proposed estimator 
is that it is efficiently computable even for 
very large scale problems by solving a second- 
order cone program (SOCP). We present the- 
oretical analysis and numerical results assess- 
ing the performance of the proposed proce- 
dure. 



1. Introduction 

Over the last fifteen years, sparse estimation meth- 
ods based on ^i-relaxation, among which the 
Lasso (Tibshirani, 1996) and the Dantzig selector 
(Candes and Tao, 2007) are the most famous exam- 
ples, have become a popular tool for estimating high 
dimensional linear models. So far, their wider use in 



ARNAK. DALALYANiaENSAE.FR 



MOHAMED. HEBIRKaUNIV-MLV.FR 



MEZIANIOCEREMADE . DAUPHINE . FR 



JOSEPH. SALMON@TELECOM-PARISTECH.FR 

several fields of applications {e.g., finance and econo- 
metrics) has been constrained by the difficulty of 
adapting to heteroscedasticity, i.e., when the noise 
level varies across the components of the signal. 

Let T be a finite set of cardinality T. For every t gT 



we observe a sequence {xt,yt) S M x 
yt^h*ixt) + s*{xt)^t 



obeying: 



(1) 



where b* : M'^ ^ K and s*^ : K'' ^ M+ are respec- 
tively the unknown conditional mean and conditional 
variance^ of yt given Xt. Then, the errors £,t satisfy 
E[^t|a;t] = and Var[^t|a;t] = 1. Depending on the 
targeted applications, elements of T may be time in- 
stances (financial engineering) , pixels or voxels (image 
and video processing) or spatial coordinates (astron- 
omy, communication networks). 

In this general formulation, the problem of estimat- 
ing unknown functions b* and s* is ill-posed: the di- 
mensionality of unknowns is too large as compared 
to the number of equations T, therefore, the model 
is unidentifiable. To cope with this issue, the pa- 
rameters (b*,s*) are often constrained to belong to 
low dimensional spaces. For instance, a common as- 
sumption is that for some given dictionary fi , . . . , fp 
of functions from M.'^ to M and for an unknown 
vector (/3*,(T*) G M^ x M, the relations b*(a;) = 
[fi{x),.. .,fp{x)]f3* and s*(a;) = a* hold for every 
X. Even for very large values of p, much larger than 
the sample size T, such a model can be efficiently 
learned in the sparsity scenario using recently intro- 
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^ This formulation of the problem includes "time- 
dependent" mean and variance, i.e., the case of E[j/tja;t] = 
ht{xt) and Var[j/t|a;t] = St{xt), since it is sufficient then 
to consider as explanatory variable [t;xj]^ instead of Xt. 
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duccd scaled versions of ^i-relaxations: the square- 
root Lasso (Antoniadis, 2010; Bclloni et al., 2011; 
Sun and Zhang, 2012; Gauticr and Tsybakov, 2011), 
the scaled Lasso (Stadler ct al., 2010) and the scaled 
Dantzig selector (Dalalyan and Chen, 2012). These 
methods are tailored to the context of a fixed noise 
level across observations (homoscedasticity), which re- 
duces their attractiveness for applications in the afore- 
mentioned fields. In the present work, we propose a 
new method of estimation for model (1) that has the 
appealing properties of requiring neither homoscedas- 
ticity nor any prior knowledge of the noise level. The 
only restriction we impose is that the variance function 
s* is of reduced dimensionality, which in our terms 
means that its inverse 1/s* is of a linear parametric 
form. 



Our contributions and related work We propose 
a principled approach to the problem of joint estima- 
tion of the conditional mean function b* and the condi- 
tional variance s* , which boils down to a second-order 
cone programming (SOCP) problem. We refer to our 
procedure as the Scaled Heteroscedastic Dantzig selec- 
tor (ScHeDs) since it can be seen an extension of the 
Dantzig selector to the case of heteroscedastic noise 
and group sparsity. Note that so far, inference un- 
der group-sparsity pioneered by (Yuan and Lin, 2006; 
Lin and Zhang, 2006), has only focused on the sim- 
ple case of known and constant noise level both in 
the early references (Nardi and Rinaldo, 2008; Bach, 
2008; Chesneau and Hebiri, 2008; Meier et al, 2009), 
and in the more recent ones (Lounici et al., 2011; 
Huang et al., 2012). In this work we provide a theoret- 
ical analysis and some numerical experiments assessing 
the quality of the proposed ScHeDs procedure. 

More recently, regression estimation under the com- 
bination of sparsity and heteroscedasticity was ad- 
dressed by (Dayc ct al., 2012; Wagencr and Dettc, 
2012; Kolar and Sharpnack, 2012). Because of the 
inherent nonconvexity of the penalized (pseudo-)log- 
likelihood considered in these works, the methods pro- 
posed therein do not estimate the conditional mean 
and the variance in a joint manner. They rather rely 
on iterative estimation of those quantities: they alter- 
nate between the two variables, estimating one while 
keeping the other one fixed. Furthermore, the theo- 
retical results of these papers are asymptotic. In con- 
trast, we propose a method that estimates the condi- 
tional mean and the variance by solving a jointly con- 
vex minimization problem and derive nonasymptotic 
risk bounds for the proposed estimators. 



Notation We use boldface letters to denote vec- 
tors and matrices. For an integer d > 0, we set 
[d] = {l,...,d}. li V e R"^ and J C [d], then vj 
denotes the sub-vector of v obtained by removing all 
the coordinates having indexes outside J. If J = {j}, 
we write vj — Vj. The £q-norms of v are defined by: 



^lo = Ej=i H'Vj 7^ 0), |t;|oo = maxjg{i,...,<i} \vj\, 

9 



l«l? = Ej=il«jl^ l<g<oo 



For a matrix A, Ai^-, and A-^j stand respectively for 
its i-th row and its j'-th column. For a vector Y = 



[yi 



,yTV G 



, we define diag(l^) as the T x T 



diagonal matrix having the entries of Y on its main 
diagonal. 

2. Background and assumptions 

We start by reparameterizing the problem as follows: 



r*(a;) = l/s*(a;), r{x) = b*{x)/s*{x). 



(2) 



Clearly, under the condition that s* is bounded away 
from zero, the mapping (s*, b*) H> (r*,f*) is bijective. 
As shown later, learning the pair (r*,f*) appears to 
be more convenient than learning the original mean- 
variance pair, in the sense that it can be performed by 
solving a convex problem. 

We now introduce two assumptions underlying our ap- 
proach. The first one is a group sparsity assumption 
on the underlying function f * . It states that there ex- 
ists a given dictionary of functions fi , . . . , fp from R'' 
to R such that f * is well approximated by a linear com- 
bination X^f^i '^'j^j with a (fixed) group-sparse vector 



4>*^ 



'\^ ■ 



The precise formulation is: 



Assumption ( Al) We denote by X the T x p matrix 
having [fi(a;t), . . . , fp(a;t)] ^^ ^"th row. Then, for a 
given partition Gi, . . . , Gk of {1, . • . ,p}, there is a 
vector ^* e W such that [f*(xi), . . .,^*{xt)Y ~ 
X</>* and Card({fc : Wq^ + 0}) < ^• 

Assumption (Al) is a restriction on f* only; the 
function r* does not appear in its formulation. Let 
us describe two practical situations which fit into 
the framework delineated by Assumption (Al), some 
other examples can be found in (Lounici ct al., 2011; 
Mairal et al, 2011; Huang et al., 2012). 

Sparse linear model with qualitative covariates 

Consider the case of linear regression with a large 
number of covariates, an important portion of which 
are qualitative. Each qualitative covariate having m 
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modalities is then transformed into a group of m bi- 
nary quantitative covariates. Therefore, the irrele- 
vance of one qualitative covariate implies the irrele- 
vance of a group of quantitative covariates, leading to 
the group-sparsity condition. 

Sparse additive model (Ravikumar et al., 2009; 
Koltchinskii and Yuan, 2010; Raskutti et al., 2012) If 
f * is a nonlinear function of a moderately large number 
of quantitative covariates, then — to alleviate the curse 
of dimensionality — a sparse additive model is often 
considered for fitting the response. This means that f * 
is assumed to be of the simple form f]^(a;i)-|-. . .+f^{xd), 
with most functions f* being identically equal to zero. 
Projecting each of these functions onto a fixed number 
of elements of a basis, f*(a;) ~ J2e=i 4>t,j'4'i{x), we get 
a linear formulation in terms of the unknown vector 
4> — {(}>£, j)- The sparsity of the additive representa- 
tion implies the group-sparsity of the vector (p. 

Our second assumption requires that there is a linear 
space of dimension q, much smaller than the sample 
size T, that contains the function r*. More precisely: 

Assumption (A2) For q given functions ri,...,rq 



mappmg 



into 



^+1 



there is a vector a € 



such that r*{x) ~ J2e=i '^i'^ti^) foi' every x £ R . 

Here are two examples of functions r* satisfying this 
assumption. 

Blockwise homoscedastic noise In time series 
modeling, one can assume that the variance of the in- 
novations varies smoothly over time, and, therefore, 
can be well approximated by a piecewise constant func- 
tion. This situation also arises in image processing 
where neighboring pixels are often corrupted by noise 
of similar magnitude. This corresponds to choosing a 
partition of T into q cells and to defining each r^ as 
the indicator function of one cell of the partition. 

Periodic noise-level In meteorology or image pro- 
cessing, observations may be contaminated by a pe- 
riodic noise. In meteorology, this can be caused by 
seasonal variations, whereas in image processing, this 
may occur if the imaging system is subject to elec- 
tronic disturbance of repeating nature. Periodic noise 
can be handled by (A2) stating that r* belongs to the 
linear span of a few trigonometric functions. 

There are essentially three methods in the literature 
providing estimators of (b*,s*) in a context close to 
the one described above. All of them assume that s* is 
constant and equal to a* and [b*(a;i), . . . , b*{xT)] = 
yi(3* with some sparse vector /3* G R^. The first 



method, termed the scaled Lasso (Stadlcr et al., 2010), 
suggests to recover (/3*,(T*) by computing a solution 
^^Sc-L jjSc-L-j ^Q ^j-^g optimization problem 



min(riog(cr)-|- 

/3,(T L 



X/3| 



A 



2a2 



-^|X,,y/3,|}, (3) 



i=i 



where A > is a scale-free tuning parameter control- 
ling the trade-off between data fitting and sparsity 
level. After a change of variables, this can be cast 
as a convex program. Hence, it is possible to find the 
global minimum relatively efficiently even for large p. 

A second method for joint estimation of (3* and 
a* by convex programming, the Square-Root Lasso 
(Antoniadis, 2010; Belloni et al., 2011), estimates /3* 
by ^SqR-L -^jjjcj^ solves 



mm 
/3 



{\Y-X(3\.^ + xJ2\X:.,m\} 



J=l 



X^SqR-] 



(4) 



as an 



and then defines a^qR-L _ -^ly ^^^ ,„ 

estimator of a* . Both in theory and in practice, these 
two methods perform quite similarly (Sun and Zhang, 
2012). 

A third method, termed scaled Dantzig selector, was 
studied by (Dalalyan and Chen, 2012) under a more 
general type of sparsity assumption (called fused or 
indirect sparsity). Inspired by these works, we propose 
a new procedure for joint estimation of the conditional 
mean and the conditional variance in the context of 
heteroscedasticity and group-sparsity. 

3. Definition of the procedure 

Our methodology originates from the penalized log- 
likelihood minimization. Assuming errors S.t are i.i.d. 
Gaussian ^"(0,1) and setting f(a;) — J2^=i^j^ji^)^ 
the penalized log-likelihood used for defining the 
group-Lasso estimator is (up to summands indepen- 
dent of (f, r)): 

PL(f, r) = ^{-log(r(a;,)) + l{rixt)yt ~ Xt,cj>f} 



ter 



K 



VaJV X 



fe=l 



jSGfc 



.3V0 



(5) 



pK 



where A = (Ai, . . . , \k) € R+ is a tuning parameter. 
A first strategy for estimating (f * , r* ) is to minimize 
PL(f , r) with respect to ^ G R^ and r e {g : R"^ ^ 
R : g(a;) > 0, for almost all x e R'^}. In view of 
assumption (A2), we can replace r by X]'=i '^i^i with 
an unknown g- vector a. 
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If we introduce the T x q matrix R having as generic 
entry ri(xt), (5) translates into a convex program with 
respect to the p + q dimensional parameter {(p, a) G 
RP X M"?, in which the cost function is: 



teT 



K 

E 
fc=i 



Afe|^:,Gfc0Gfc 



fcl2' 



(6) 



and the constraint minj Rt^.cx > should be imposed 
to guarantee that the logarithm is well defined. This 
is a convex optimization problem, but it does not 
fit well the framework under which the convergence 
guarantees of the state-of-the-art optimization algo- 
rithms are established. Indeed, it is usually required 
that the smooth components of the cost function have 
Lipschitz-smooth derivative, which is not the case for 
(6) because of the presence of the logarithmic terms. 
One can circumvent this drawback by smoothing these 
terms^, but we opted for another solution that re- 
lies on an argument introduced in (Candes and Tao, 
2007) for justifying the Dantzig selector. Let Tlck = 
X:_G^(X7^Q^,X:_Gfc)^X^Q^ be the orthogonal projector 
onto the range of X:,g^ in M^. 

Definition 3.1. Let A S M.^ be a vector of tuning pa- 
rameters. We call the Scaled Heteroscedastic Dantzig 
selector (ScHeDs) the pair ((p,a), where ((p,a,v) is a 



minimizer w.r.t. ((j),a,v) G 
function 

V 

subject to the constraints 



of the cost 



^K 



Afc|X:,Gfe^, 



Gfcb 



nG,(diag(r)Ra -X0)|2 < A^, Vfc e [K]; (7) 

R^v < R^diag(y)(diag(r)RQ: - X0); (8) 

1/vt < Rt,a; Vt G T. (9) 



Constraints (7)-(9) are obtained as convex relaxations 
of the first-order conditions corresponding to minimiz- 
ing (6). In fact, Eq. (7) is a standard relaxation for 
the condition e d(i,PL{4),a), whereas constraints 
(8) and (9) are convex relaxations of the equation 
daPL{4>, a.) = 0. Further details on this point are 
provided in the supplementary material. At this stage 
and before presenting theoretical guarantees on the 
statistical performance of the ScHeDs, we state a result 
telling us the estimator we introduced is meaningful. 

Theorem 3.2. The ScHeDs is always well defined in 
the sense that the feasible set of the corresponding op- 



^This will result in introducing new parameters the tun- 
ing of which may increase the difficulty of the problem. 



timization problem is not empty: it contains the min- 
imizer of (6). Furthermore, the ScHeDs can be com- 
puted by any SOCP solver. 

The proof is placed in the supplementary material. As 
we will see later, thanks to this theorem, we carried out 
two implementations of the ScHeDs based on an inte- 
rior point algorithm and an optimal first-order proxi- 
mal method. 

4. Comments on the procedure 

Tuning parameters One apparent drawback of the 
ScHeDs is the large number of tuning parameters. For- 
tunately, some theoretical results provided in the sup- 
plementary material suggest to choose \k = \o\/rk, 
where Aq > is a one-dimensional tuning parame- 
ter and Tfe = rank(X:^Gj.). In particular, when all 
the predictors within each group are linearly indepen- 
dent, then one may choose A proportional to the vector 
(v/Card(Gi), . . . , v/Card(GK)). 

Additional constraints In many practical situa- 
tions one can add some additional constraints to the 
aforementioned optimization problem without leaving 
the SOCP framework. For example, if the response 
y is bounded by some known constant Ly, then it is 
natural to look for conditional mean and conditional 
variance bounded respectively by Ly and L^. This 
amounts to introducing the (linearizable) constraints 
|-X't.:^| < LyRt,a and Rt^.oc > 1/Ly for every t G T. 

Bias correction It is well known that the Lasso and 
the Dantzig selector estimate the nonzero coefficients 
of the regression vector with a bias toward zero. It was 
also remarked in (Sun and Zhang, 2010), that the es- 
timator of the noise level provided by the scaled Lasso 
is systematically over-estimating the true noise level. 
Our experiments showed the same shortcomings for 
the ScHeDs. To attenuate these effects, we propose a 
two-step procedure that applies the ScHeDs with the 
penalties X^. = Xo^/rk at the first step and discards 
from X the columns that correspond to vanishing coef- 
ficients of 0. At the second step, the ScHeDs is applied 
with the new matrix X and with A = 0. 

Gaussian assumption Although the proposed al- 
gorithm takes its roots from the log-likelihood of the 
Gaussian regression, it is by no means necessary that 
the noise distribution should be Gaussian. In the 
case of deterministic design Xt, it is sufficient to as- 
sume that the noise distribution is sub-Gaussian. For 
random i.i.d. design, arguments similar to those of 
(Belloni et al., 2011; Gautier and Tsybakov, 2011) can 
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be applied to show oracle inequalities for even more 
general noise distributions. 

Equivariance Given the historical data {yi-.T, Xi:t) 
of the response and the covariates, let us denote by 
VT+iiyi-.T) = [ELi<?jfj(^T+i)]/[ELi"^i'£(a;T+i)] 
the prediction provided by the ScHeDs for a new ob- 
servation xt+1- This prediction is equivariant with 
respect to scale change in the following sense. If all 
the response values yi, . . . ,yT are multiplied by some 
constant c, then it can easily be proved that the new 
prediction can be deduced from the previous one by 
merely muhiplying it by c: yr+iicyi-.r) = cyT+iiyi-.r)- 

Most papers dealing with group-sparsity 
(Lounici et al., 2011; Liu et al., 2010; 

Huang and Zhang, 2010) use penalties of the form 
Ej. |Dfe(/)Q I2 with some diagonal matrices D^. In 
general, this differs from the penalty we use since 



in our case D^ = O^^g ^■■■ 



\l/2 



is not necessarily 



diagonal. Our choice has the advantage of being 
equivariant w.r.t. (invertible) linear transformations 
of predictors within groups. 

Interestingly, this difference in the penalty definition 
has an impact on the calibration of the parameters A^ : 
while the recommended choice is A^ oc Card(G'fe) when 
diagonal matrices'^ D^ are used, it is A^ ex rank(X.^(3^,) 
for the ScHeDs. Thus, the penalty chosen for the 
ScHeDs is slightly smaller than that of the usual group- 
Lasso, which also leads to a tighter risk bound. 

5. Risk bounds 

We present a finite sample risk bound showing that, 
under some assumptions, the risk of our procedure is of 
the same order of magnitude as the risk of a procedure 
based on the complete knowledge of the noise-level. 

Recall that the model introduced in the foregoing sec- 
tions can be rewritten in its matrix form 



diag(y)Ra* = X0* + ^, 



(10) 



with ^1, . . . , ^T i.i.d. zero mean random variables. To 
state the theoretical results providing guarantees on 
the accuracy of the ScHeDs estimator {(p, a), we need 
some notation and assumptions. 

For 4>* £ MP, we define the set of relevant groups /C* 
and the sparsity index s* by 



i^* = {k--\rGj,^o} 



E 



feSK* 



rk, 



(11) 



Even if the matrices Dfc are not diagonal and 
are chosen exactly as in our case, recent references 
like (Simon and Tibshirani, 2012) suggest to use A| oc 
Card(G'fc) without theoretical support. 



Note that these quantities depend on (/>* . To establish 
tight risk bounds, we need the following assumption 
on the Gram matrix X^X, termed Group-Restricted 
Eigenvalues (GRE). 

Assumption GRE(iV, k): For every JC C [p] of car- 
dinality not larger than N and for every <5 G M^ satis- 
fying 



Xl^e^fe|^^G,^G,|2 <Er^'=|-^-^''^<= 



fe|2' 



(12) 



it holds that X^ , > K'^J^keic ^-.Gk^Gk 2' 




We also set 




C ma- ^ V ^«(^*^^'^*)' 


(13) 


1 .r^ r^^ 
C2 = max — > ,„ , 


(14) 


r< • 1 Y^ ^t^ 


(15) 



and define C4 = {VC2 + 

To establish nonasymptotic risk bounds in the het- 
eroscedastic regression model with sparsity assump- 
tion, we first tried to adapt the standard techniques 
(Candcs and Tao, 2007; Bickcl et al., 2009) used in the 
case of known noise-level. The result, presented in 
Theorem 5.1 below, is not satisfactory, since it pro- 
vides a risk bound for estimating (p* that involves the 
risk of estimating a* . Nevertheless, we opted for stat- 
ing this result since it provides guidance for choosing 
the parameters Afc and also because it constitutes an 
important ingredient of the proof of our main result 
stated in Theorem 5.2 below. 

Theorem 5.1. Consider model (10) with determinis- 
tic matrices X and R. Assume that the distribution 
of ^ is Gaussian with zero mean and an identity co- 
variance matrix and that Assumption GRE(K* , k) is 
fulfilled With K* = Card(^*). Let e G (0, 1) be a tol- 
erance level and set 

Xk ^ 2{rk + 2^rklog{K/e) + 2log{K/e)y^\ 

IfT > 8C4log( — ) then, with probability at least 1 — 2e, 



|X(</. - (f>% < CWi8/T)\og{2q/s){2\XcP*\2 + lib) 
+ -^y2s* -^3K*log{K/e) 
+ |diag(Y)R(S-a*)|2. (16) 

In order to gain understanding on the theoretical limits 
delineated by the previous theorem, let us give more 
details on the order of magnitude of the three terms 
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ScHeDs 



Square-root Lasso 



1/3-/3*12 



10|a 



1/3-/3- 



1015 



(T, p, s*, a') 


Ave 


StD 


Ave 


StD 


Ave 


StD 


Ave 


StD 


Ave 


StD 


Ave 


StD 


(100, 100,2,0.5) 


.06 


.03 


.00 


.00 


.29 


.21 


.08 


.05 


.19 


.42 


.31 


.23 


(100, 100,5,0.5) 


.11 


.06 


.00 


.00 


.29 


.31 


.12 


.05 


.16 


.41 


.30 


.24 


(100, 100,2,1.0) 


.13 


.07 


.02 


.14 


.53 


.40 


.16 


.11 


.19 


.44 


.56 


.42 


(100, 100,5,1.0) 


.28 


.24 


.08 


.32 


.76 


.78 


.25 


.13 


.19 


.44 


.66 


.57 


(200, 100,5,0.5) 


.08 


.03 


.00 


.00 


.20 


.16 


.09 


.03 


.20 


.46 


.22 


.16 


(200, 100,5,1.0) 


.15 


.05 


.01 


.09 


.40 


.30 


.17 


.07 


.20 


.44 


.42 


.31 


(200, 500,8,0.5) 


.10 


.03 


.00 


.04 


.23 


.16 


.11 


.03 


.17 


.40 


.24 


.17 


(200, 500,8,1.0) 


.21 


.13 


.02 


.17 


.50 


.58 


.22 


.08 


.19 


.43 


.46 


.38 


(200, 1000, 5, 1.0) 


.15 


.05 


.01 


.08 


.40 


.31 


.17 


.07 


.17 


.40 


.42 


.33 



Table 1. Performance of the (bias corrected) ScHeDs compared with the (bias corrected) Square-root Lasso on a synthetic 
dataset. The average values and the standard deviations of the quantities \/3 — /3*|2, |s^ — s*| and 10|a — a*\ over 500 
trials are reported. They represent respectively the accuracy in estimating the regression vector, the number of relevant 
covariates and the level of noise. 



appearing in (16). First, one should keep in mind that 
the correct normalization of the error consists in divid- 
ing X(0 — <p*)\ by vT. Assuming that the function 
b* is bounded and using standard tail bounds on the 
Xt distribution, we can see that the first term in the 
right-hand side of (16) is negligible w.r.t. the second 
one. Thus if we ignore for a moment the third term. 
Theorem 5.1 tells us that the normalized squared error 
r^^|X(0 — 0*)| of estimating 0* by cj) is of the order 
of s*/T, up to logarithmic terms. This is the (opti- 
mal) fast rate of estimating an s* -sparse signal with T 
observations in linear regression. 

To complete the theoretical analysis, we need a bound 
on the error of estimating the parameter a*. This is 
done in the following theorem. 

Theorem 5.2. Let all the conditions of Theorem 5.1 
be fulfilled. Let q and T he two integers such that 1 < 
q <T and let e G (0, 1/5). Assume that for some con- 



stant Di > 1 the inequality max^gT- -^ 
true and denote I?T,e = ^i(2|X</)*|^ 



Rt. 



< Di holds 



51og(2T/£)). 

Then, on an event of probability at least 1 — 5e, the 
following inequality is true: 



|X(cA-0* 



|2<4(C4- 
, 8DT.e 



K 



|-l)Z?^/,V2glog(2g/e) 
^/2s*+3K*\og{K/e). (17) 



Furthermore, on the same event. 
|R(a* -a) 



Dl^^\Ra* 



< 4(C4 + 2)D^J^^^2q\og{2q/e) 



8D 



T,E 



^/2s* +3K*\og{K/e). (18) 



The first important feature of this result is that it pro- 
vides fast rates of convergence for the ScHeDs esti- 
mator. This compares favorably with the analogous 



result in (Kolar and Sharpnack, 2012), where asymp- 
totic bounds are presented under the stringent condi- 
tion that the local minimum to which the procedure 
converges coincides with the global one. The joint con- 
vexity in and a of our minimization problem allows 
us to avoid such an assumption without any loss in the 
quality of prediction. 

One potential weakness of the risk bounds of Theorem 
5.2 is the presence of the quantity Di, which controls, 
roughly speaking, the loo norm of the vector RS. One 
way to circumvent this drawback is to add the con- 
straint maxt Rt,:Ot < /i* to those presented in (7)-(9), 
for some tuning parameter /i*. In this case, the op- 
timization problem remains an SOCP and in all the 
previous results one can replace the random term Di 
by ^* /fj,^,, where /i* is a lower bound on the elements 
of the vector Ra*. This being said, we hope that 
with more sophisticated arguments one can deduce 
the boundedness of Di by some deterministic constant 
without adding new constraints to the ScHeDs. 

One may also wonder how restrictive the assumptions 
(13)-(15) are and in which kind of contexts they are 
expected to be satisfied. At a heuristic level, one may 
remark that the expressions in (13)-(15) are all em- 
pirical means: for instance, (1/^) X^t '^«/(-^t,:^*)^ = 
(1/T) J2t '^i{'^tY h*{^tY- Assuming that the time se- 
ries {xt\ is stationary or periodic, these empirical 
means will converge to some expectations. There- 
fore, under these types of assumptions, (13)-(15) are 
boundedness assumptions on some integral functionals 
of r*, f* and r^'s. In particular, if r^'s are bounded and 
bounded away from 0, f* is bounded and r* is bounded 
away from zero, then the finiteness of the constant C4 
is straightforward. 

To close this section, let us emphasize that the GRE 
condition is sufficient for getting fast rates for the 
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performance of the ScHeDs measured in prediction 
loss, but is by no means necessary for the consistency. 
In other terms, even if the GRE condition fails, the 
ScHeDs still provides provably accurate estimates that 
converge at a slower rate. This slow rate is, roughly 
speaking, of the order [T^^{s* +K* logK)]^/'^ instead 



of [T- 



K* log K)] 



1/2 



6. Experiments 

To assess the estimation accuracy of our method and 
to compare it with the state-of-the-art alternatives, 
we performed an experiment on a synthetic dataset. 
Then, the prediction ability of the procedure is evalu- 
ated on a real-world dataset containing the tempera- 
tures in Paris over several years. 

6.1. Implementation 

To effectively compute the ScHeDs estimator we 
rely on Theorem 3.2 that reduces the computation 
to solving a second-order cone program. To this 
end, we implemented a primal-dual interior point 
method using the SeDuMi package (Sturm, 1999) 
of Matlab as well as several optimal first-order 
methods (Ncstcrov, 1983; Auslender and TebouUe, 
2006; Beck and Teboulle, 2009) using the TFOCS 
(Becker et al., 2011). We intend to make our code 
publicly available if the paper is accepted. Each of 
these implementations has its strengths and limita- 
tions. The interior point method provides a highly 
accurate solution for moderately large datasets (Fig. 1, 
top), but this accuracy is achieved at the expense of 
increased computational complexity (Fig. 1, bottom). 
Although less accurate, optimal first-order methods 
have cheaper iterations and can deal with very large 
scale datasets (see Table 2). All the experiments were 
conducted on an Intel(R) Xeon(R) CPU ®2.80GHz. 

6.2. Synthetic data 

In order to be able to compare our approach to other 
state-of-the-art algorithms, we place ourselves in a set- 
ting of homoscedastic noise with known ground truth. 
We randomly generate a matrix X e R-^^p with i.i.d. 
standard Gaussian entries and a standard Gaussian 
noise vector ^ S R-^ independent of X. The noise vari- 
ance is defined by at = a* with varying values cr* > 0. 
We set /3° = [Ig-, Op^s*V and define 0* = (3* /a*, 
where f3* is obtained by randomly permuting the en- 
tries of /3°. Finally, we set Y = a*{'Xcj)* + $,). 

Nine different settings depending on the values of 
(T,p, s*, cr*) are considered. In each setting the exper- 
iment is repeated 500 times; the average errors of esti- 
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Figure 1. Comparing implementations of the ScHeDs: in- 
terior point (IP) vs. optimal first-order (OFO) method. We 
used the experiment described in Section 6.2 with T = 200, 
s* = 3, cr = 0.5. Top: square-root of the MSE on the 
nonzero coefficients of /3*. Middle: square-root of the MSE 
on the zero coefficients of /3*. Bottom: running times. 

p I 200 I 400 I 600 I 800 I 1000 



IP (sec/iter) 

OFO (100*sec/iter) 



0.14 


0.70 


2.15 


4.68 


0.91 


1.07 


1.33 


1.64 



9.46 
1.91 



Table 2. Comparing implementations of the ScHeDs: inte- 
rior point (IP) vs. optimal first-order (OFO) method. We 
report the time per iteration (in seconds) for varying p 
in the experiment described in Section 6.2 with T = 200, 
s* = 2, a — 0.1. Note that the iterations of the OFO are 
very cheap and their complexity increases linearly in p. 

mation of /3 , s* and a* for our procedure and for the 
Square-root Lasso are reported in Table 1 along with 
the standard deviations. For both procedures, the uni- 
versal choice of tuning parameter A = ■\/2 log(p) is 
used (after properly normalizing the columns of X) 
and a second step consisting in bias correction is ap- 
plied (c/. (Sun and Zhang, 2012) and the discussion in 
Section 4 on bias correction). Here, we did not use any 
group structure so the penalty is merely proportional 
to the £i-norm of /3. One can observe that the ScHeDs 
is competitive with the Square-root Lasso, especially 
for performing variable selection. Indeed, in all consid- 
ered settings the ScHeDs outperforms the Square-root 
Lasso in estimating s*. 

6.3. Application to the prediction of the 
temperature in Paris 

For experimental validation on a real-world dataset, 
we have used data on the daily temperature in 
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Figure 2. Top row: increments of temperatures (in Fahrenheit) from one day to the next observed in Paris in 2008. Middle 
row: predictions provided by our ScHeDs procedure; we observe that the sign is often predicted correctly. Bottom row: 
estimated noise level. 



Paris from 2003 to 2008. It was produced 

by the National Climatic Data Center (NCDC), 
(Asheville, NC, USA) and is publicly available at 
ftp: //f tp.ncdc .noaa.gov/pub/data/gsod/. Per- 
forming good predictions for these data is a challenging 
task since, as shown in Fig. 2, the observations look 
like white noise. The dataset contains the daily aver- 
age temperatures, as well as some other measurements 
like wind speed, maximal and minimal temperatures, 
etc. 

We selected as response variable yt the difference of 
temperatures between two successive days. The goal 
was to predict the temperature of the next daybased on 
historical data. We selected as covariates Xt the time 
t, the increments of temperature over past 7 days, the 
maximal intraday variation of the temperature over 
past 7 days and the wind speed of the day before. In- 
cluding the intercept, this resulted in a 17 dimensional 
vector Xf. Based on it, we created 136 groups of func- 
tions f, each group containing 16 elements. Thus, the 
dimension of <p* was 136 x 16 = 2176. We chose q — 11 
with functions r^ depending on time t only. The precise 
definitions of fj and r^ are presented below. 

To specify X, we need to define the functions fj gen- 
erating its columns. We denote by Ut the subvector of 
Xf obtained by removing the time t. Thus, Ut is a 16- 
dimensional vector. Using this vector Ut € M^^, we de- 
fine all the second-order monomes: Xi.i'i'^t) — ^t "t 
with i < i' . We look for fitting the unknown func- 
tion f* by a second-order polynomial in Ut with coef- 
ficients varying in time. To this end, we set tpi{t) = 1j 
Mt) = t^^(-'^~'^\ for £ = 2,3,4 and 



Once these functions Xi,i' ^^'^ V'f defined, we denote 
by fj the functions of the form tjj({t)xi,i'{ut)- In other 
terms, we compute the tensor product of these two 
sets of functions, which leads to a set of functions {f^} 
of cardinality 16 x 16 x 17/2 = 2176. These functions 
are split into 136 groups of 16 functions, each group 
defined by Gj,^' = {ipi{t) x Xi,i'iut) : ^ = 1, . . . , 16}. 

We defined R as a T x 11 matrix, each of its eleven 
columns was obtained by applying some function r^ 
to the covariate Xt for t = 1, . . . , T. The functions r^ 
were chosen as follows: ri{xt) — 1, r2{xt) = t, r3{xt) = 
l/{t + 2x 365)5 and 



re{xt) = 


= 1 + cos{2tt{1 - 


~3)f/365); 


£-=4,. 


•,7; 


reixt) = 


= 1 + cos(27r(£ - 


- 7)f/365); 


£ = 8,. 


.,11 



Mt) =cos(27r(£-4)t/365); 
Mt) =sin(27r(£-10)V365); 



£ = 5,..., 10; 
£= 11,...,16. 



Note that these definitions of X and R are somewhat 
arbitrary. Presumably, better results in terms of pre- 
diction would be achieved by combining this purely 
statistical approach with some expert advice. 

We used the temperatures from 2003 to 2007 for train- 
ing (2172 values) and those of 2008 (366 values) for 
testing. Applying our procedure allowed us to reduce 
the dimensionality of from 2176 to 26. The result 
of the prediction for the increments of temperatures 
in 2008 is depicted in Fig. 2. The most important 
point is that in 62% of the cases the sign of the in- 
crements is predicted correctly. It is also interesting 
to look at the estimated variance: it suggests that the 
oscillation of the temperature during the period be- 
tween May and July is significantly higher than in 
March, September and October. Interestingly, when 
we apply a Kolmogorov-Smirnov test to the residuals 
ytRt,:(i — Xt,:4> ioT t belonging to the testing set, the 
null hypothesis of Gaussianity is not rejected and the 
p value is 0.72. 
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7. Conclusion and outlook 

We have introduced a new procedure, the ScHeDs, 
that ahows us to simultaneously estimate the condi- 
tional mean and the conditional variance functions 
in the model of regression with heteroscedastic noise. 
The ScHeDs relies on minimizing a group-sparsity pro- 
moting norm under some constraints corresponding to 
suitably relaxed first-order conditions for maximum 
penalized likelihood estimation. We have proposed 
several implementations of the ScHeDs based on vari- 
ous algorithms of second-order cone programming. We 
have tested our procedure on synthetic and real world 
datasets and have observed that it is competitive with 
the state-of-the-art algorithms, while being applicable 
in a much more general framework. Theoretical guar- 
antees for this procedure have also been proved. 

In a future work, we intend to generalize this ap- 
proach to the case where the inverse of the conditional 
standard deviation belongs to a reproducing kernel 
Hilbert space, or admits a sparse linear representa- 
tion in a large, possibly over-complete, dictionary. The 
extension of our methodology to the case of nonover- 
lapping groups (Obozinski et al., 2011; Mairal et al., 
2011) and the substitution of the •^i/€2-norm penalty 
by more general £i/£q-norms in our framework are 
challenging avenues for future research. 
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8. Supplement to the paper: Learning Heteroscedastic Models by Convex 
Programming under Group Sparsity 

8.1. Proof of Theorem 3.2 

The fact that the feasible set is not empty fohows from the fact that it contains the minimizers of (6). This 
immediately follows from the first-order conditions and their relaxations. Indeed, for a minimizer {(j)°,a°) of 
(6), the first-order conditions take the following form: there exists v° E R^ such that for all k e [A'] and i E [q], 

^ -PL(0°, a°) = -X^G^ (diag(y)Ra° - X0°) + AfeX^^, J'''''tf'°| = 0, (19) 



^PL(0°, «°) = - Y.,^^ ^ + E,^r (y*^'-^"" ~ X,,cf>°)ytru ~ {v°YR,. = 0, (20) 

and v°Rt.:a° = for every t. It should be emphasized that relation (19) holds true only in the case where the 
solution satisfies min^ \X ■,^Gk'P° Gk\'^ t^ Oj otherwise one has to replace it by the condition stating that the null 
vector belongs to the subdifferential. Since this does not alter the proof, we prefer to proceed as if everything 
was differentiable. 

On the one hand, {cj)°,a°) satisfies (19) if and only if nGfc(diag(Y)Ra° — X0°) = AfeX:_Gfc 00^/1-^:, Cfc^G^h 
with Uck = ^:.Gfc(X^(5 X:_Gfc)~''X^Q being the orthogonal projector onto the range of X:^^^. in MF . Taking the 
norm of both sides in the last equation, we get |nGfc(diag(Y)Ra° — X0°)| < A^. This tells us that {4>°,a.°) 
satisfy (7). On the other hand, since the minimum of (6) is finite, one easily checks that Rt^.a." ^ and, 
therefore, f° = 0. Replacing in (20) u" by zero and setting v^ = l/Rt^-o" we get that {<p° ,a.°,v°) satisfies 
(8), (9). This proves that the set of feasible solutions of the optimization problem defined in the ScHeDs is not 
empty. 

Let us show that one can compute the ScHeDs (0, a) by solving an SOCP. More precisely, we show that if 
{(p, a, u, 9) e ]RP X M'J X R^ x R^ is a solution to the following problem of second-order cone programming: 

EK 
XkUk (21) 

k—l 

subject to (7) and 

IX.G.'/'G.L < "fe' Vke[K], (22) 

K^v < R^diag(r)(diag(r)Ra - X</)); (23) 

\[vt;Rt,:a;V2]\^<vt+Rt,a; yteT, (24) 

then {(f>, a,v) is a solution to the optimization problem stated in Definition 3.1. This claim readily follows from 
the fact that the constraint [w^; i?t^:Q;; v2] „<«*-!- Rt^.ct can be equivalently written as vt{Rt,:Cy) > 1 and 
vt + Rt,:Ct > for every t. This yields vt > and Rt,.OL > 1/vt for every t. Furthermore, it is clear that if 
(0, S, u, v) is a solution to the aforementioned optimization problem, then all the inequalities in (22) are indeed 
equalities. This completes the proof. 

8.2. Proof of Theorem 5.1 

To prove Theorem 5.1, we first introduce a feasible pair (0, a), in the sense formulated in Lemma 8.1. 

Lemma 8.1. Consider the model (10). Let z = 1 + 2C4\/ — "s^'?/^-' with some e > and assume that z < 2. 

Then with probability at least 1 — 2£, the triplet {(p,a,v) = {zc/)* , za* , {-^ — — r)t=i,...,T) satisfies constraints (7), 

(8) and (9). Moreover, the group- sparsity pattern {fc : IcpQ 7^ ^l "^Z </" coincides with that of <p* , that is with 
JC*. 

The proof of this lemma can be found in Section 8.3. 
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Set A. — 4> — 4>. On an event of probability at least 1 — 2e, (0, a.) is a feasible solution of the optimization 
problem of the ScHeDs whereas (</), a) is an optimal solution, therefore 

K K K K 

k=l k=l fc=l fc=l 

= ^ Afe|X:,G.AG,|2+ Yl ^ki\^--,G,(l>G,\2~\'^--,G,4>G,\2) 
keK* k<£K' 

<2^ Afe|X:,G,AG4. (25) 

fee/c* 

This readily implies that 

/ . ^k\^:,Gk^Gk\2^ / . ^fc|X:,GfcAGj^. 



Applying GRE(k, s) assumption and the Cauchy-Schwarz inequality, we get 

nx ""^-^ 

k=l kelC* keK* 

It is clear that 



^A..|X,g.Ag4 <2( Y. ^t)''\ E IX.c.AG.ig''' < 1{Y.,,^A)''' |XA|, (26) 



|XA|2 = A^X^(X0 - X^) 

= A^X^(X$ - diag(RS)r) + A^X^(diag(Ra)r - X^) + A^X^diag(y)R(a - 5). 

In addition, using the relation XA = X)/c=i -^^Gt Ag^ = ^j,^^ nGfcX:_Gfe Ag^ and the fact that both {(f), a) 
and (</>, (5) satisfy constraint (7), we have 

K K 

|XA|^ < Y, A5^X^G,nG,(X0 - diag(Ra)y) + ^ A5^X^G,nG,(diag(R5)F - X0) 

k=l k=l 

+ A^X^diag(r)R(S - a) 

K 

<2^Afe|X^^G.AG,|2 + |XA|2.|DyR(a-5)|2. (27) 



k=l 

,1/2 



Therefore, from (26), |XA|2 < KEfcec* ^fe) + |DyR(a - S)|2 and we easily get 

|X(0 - r)\2 < |X(0 - 0*)|2 + |XA|2 <{z- 1)|X0*|2 + -(E,.r. ^')'^' + |DyR(S - 5)|2. 



where we have used the following notation: for any vector u, we denote by D^ the diagonal matrix diag(v). 
To complete the proof, it suffices to replace z and Xk by their expressions and to use the inequality 

|DyR(S - a)\2 < |DyR(S - a*)|2 + (z - l)|DyRa*|2 

< |DyR(S - a*)|2 + (z - 1)|X0* + ^|2 

< |DvR(5 - a*)\2 + {z- 1)(|X0*|2 + l^b). 

8.3. Proof of Lemma 8.1 

For all e G (0, 1), consider the random event Be = Cfe^i {^^ i^ ^l i)^ where 



^h = [Eter -^r^^'^--"^*^' - -v/2C2riog(2g/£)| , 
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^M = {E,,r ^7^(^* -^^- -VCiTlog(2q/£)| . 

Using standard tail estimates for the Gaussian and the x^ distributions, in conjunction with the union bound, 
one easily checks that P{Be) > 1 — e. In what follows, we show that on the event B^, (0, S, v) satisfies constraints 

(7)-(9). 

Constraints (9) are satisfied (with equality) by definition of v. To check that (8) is satisfied as well, we should 
verify that ior all i — 1, . . . ,q, 



z2 Z^teT Rt,a* - -^teT Rt.a* '-^ ^* Z^teT Rt .a*^' 



On the event B^^ the right-hand side of the last inequality can be lower bounded as follows: 



i:.cr id?^-*'^' + S.er Td?«' ^ ^'^ + v^)v/55T;i(V5 + X],^ 



'KT Rt..a* "^-^ ■"■ ^L^ter Rt.-a*^' - ^^ ^ '^ o\ ■:./ / ^l^tgr fit^^a* 

Thus, on Bi; if for all ^ = I, . . . , q 



E,^r R^ - i^2 + V^W2T\og{2q/e) (28) 



then constraint (9) is fulfilled by {cf),a.,v). Inequality (28) is valid since for any z>l 

^ 1 O' 



'^^teT Rt a* z \ z I ^-^ 



eT Rf .oi* z V z h^-^teT Rf .OL* 



3 



and ^TCs > (y/C^ + y/2C^)^2Tlog{2q/e) when z = I + 2C4 J^^SS^sM < 2. 



T 

On the other hand, since z < 2, a sufficient condition implying that the pair (</>, a) satisfies (7) is 

2\UG,^\^<Xk, Vfce {!,..., A'}. (29) 

Recall that rk denotes the rank of Tlck ■ Let TZe be the random event of probability at least 1 — e defined as 
follows 

K K 

7^e= {^ne,k= fl [\n.G,i\l < rfe + 2Vrfclog(i^/£) + 21og(A7e)} . 
fe=i k=i 

To prove that P{Tle) > I — e, we use the fact that 11(5^.^12 is drawn from the Xr^. distribution. Using well-known 
tail bounds for the x'^ distribution, we get P{TZ'^ k) — W' Then, we conclude by the union bound. 

Since we chose 

2irk + 2^rk\og{K/e) + 2\og{K/e)f/^^\k, 

on the event TZe inequality (29) is satisfied by {4>, a). 

Finally, the triplet {4>,a,v) fulfills constraints (7)-(9) on the event B^ TZ^, which is of a probability at least 
l-2e. 

8.4. Proof of Theorem 5.2 

We start by noting that, the ScHeDs (0, a) satisfies V^ G {1, . . . ,q}, the relation 

E.,r ^ - E.,r (^'^^'^'^ ^ xa)mru. (30) 
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First, for the ScHeDs, all the inequalities in (9) are equalities. Indeed, wt's are only involved in (8) and (9) and if 
we decrease one vt to achieve an equality in (9), the left-hand side of (8) will decrease as well and the constraint 
will stay inviolated. Thus, setting vt — l/Rt^-.a, we get from (8) 



E 



ru 



teT Rt a. 



^E,.r(^*^*'^"-^*-^^) 



Vtru, 



V^e{l,...,g}. 



(31) 



To be convinced that Eq. (30) is true, let us consider for simplicity the one dimensional case 9=1. If inequality 
(31) was strict, for some w G (0, 1), the pair {wcf), wa) would also satisfy all the constraints of the ScHeDs and 
the corresponding penalty term would be smaller than that of (0, a). This is impossible since <f) is an optimal 
solution. Thus we get 



Using the identity (Rt.-.a)^^ — (Rt.-a*)^^ + {Rt^-.OLRt^-.a.*)^^ Rt.:{a.* — a), we get 

[E 7S-^)k^«?:«.:] ("* - ") — Y,^M. + R-D.(D.RS - X$) 



(32) 



t&r 



ter' 



R^D-^.It + R^D|.R(S - a*) - R^DyX(0 - 0*) 
+ R^Dy (DvRa* X0*). 



In view of the identities DyRa* - X0* = ^ and Dy = Dp^^.(Dx</,* + D^), Eq. (33) yields^ 



R' 



D^ + n^l^-D^l R(a* - S) = R'B^Ui' It) R^DyX(0 - 0*) + R^D^l^M^^.t 



(33) 



(34) 



As a consequence, denoting by M the Moore-Penrose pseudo- inverse of the matrix R^ [Dy + D^^^.Dj^gJR, 

R(a* - S) = RMRT (b^I, ie - It) - DyX(0 - cj,*) + D^i^.Dx,/,.^') • (35) 

Multiplying both sides by Dy and taking the Euclidean norm, we get 



|DyR(a* - a)!^ < |DyRMR^(DR^.(^2 - It) + D^^.Dx,/,*!) [ + |DyRMR^DyX(0* - 0) 



(36) 



At this stage of the proof, the conceptual part is finished and we enter into the technical part. At a heuristic 
level, the first norm in the right-hand side of (36) is bounded in probability while the second norm is bounded 
from above by (1 — c)|X(0* ^ 0)1, for some constant c e (0, 1). Let us first state these results formally, by 
postponing their proof to the next subsection, and to finalize the proof of the theorem. 

Lemma 8.2. Let q and T he two integers such that 1 < q < T and let e e (0, 1/3) he some constant. Assume 
that for some constant Di > 1 the inequality maxt^j- j^'^^, < Di holds true. Then, on an event of prohahility 
at least 1 — 3e, the following inequalities are true^ : 



||Ml/2RTDy||| < 1^ 

M1/2rT(Dj,^,(^2 - It) +D^^.Dx0.^)|^ < lO^g^i log(2T/£)log(2g/£), 
IDvRMR^Dvl < 1-^^- — <1- "^ 



2Z?i (|X0* 1^ + IIIL) + 1 " Z?i (2|X0* 1^ + 5 log(2r/£)) 
In view of these bounds, we get that on an event of probabihty at least 1 — 3^, 

\ \'\r ( jj^ ^\ I 

5i(2|X0*|^ + 51og(2T/£));'l '^^^ 



DyR(a* - S)| < 10V9i?ilog(2r/e)log(2(7/£) + 1 - 



(37) 
(38) 

(39) 
(40) 



We denote by ^ the vector {^t )teT. 
^Here and in the sequel, the spectral norm of a matrix A is denoted by |||A|||. 
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Combining this inequality with Theorem 5.1 and using the inequahty 2|X^*|2 + |^|2 < v2^(2|X</)*|oo + |^|oo)j 
we get that the following inequalities are satisfied with probability > 1 — 5s: 



X{4,-4,%<2C4D^{2\Xcf>*\l, + 5log{2T/e))V2\og{2q/e){2\X<t>*\^ + \^U 



-2s* 

K 



1/2 



3K* log{K/e) D^ (2|X</,* |^ + 5 log(2r/£)) 



+ lODi (2|X(/.* IL + 5 log(2T/e)) ^JqD, \og{2T/e) \og{2q/e) 
< 4CiD^{2\Xct>*\l + 5log{2T/e)f'^2logi2q/s) 



8Di 



{2\Xcf>*\l + 5 log(2r/£)) (2s* + 3K* \og{K/e)) 



1/2 



+ lOD, {2\Xcjy*\l + 5 log(2T/e)) ^J qD, \og{2T/e) \og{2q/e) 
Using the notation Dt.e = i5i(2|X0*|^ + 51og(2r/£)), we obtain 



|X(0 - </.*)L < ^CiDi/t^2\og{2q/e 



8Dt, 



(2s* + 3K* \og{K/e)) 



1/2 



+ 10DT,eyqDi log(2T/e) log(29/e). 
To further simplify the last term, we use the inequalities: 



10DT,e^qDi\og{2T/e)log{2q/e) - DT,eVlO^/5Dilog{2T/e)^/2q\og{2q/e) 

< 4i?^2V2glog(2g/£). 
Combining this with (42) yields (17). 
To prove (18), we use once again (35) to infer that 

|R(a* - a)!^ < |RMR^(Dai„4^2 _ ^^^^ ^ D^i„.Dx</>-$) [ + |RMR"rDyX((/.* ~ 0)|^ 
< |||RMV2|||(^|mV2rT(^d^i„4^2 - It) + Dj,^.Dx<^*|) |^ + |||MV2rTd^||| 
In view of Lemma 8.2, this leads to 



|R(a* - S)!^ < |||RMi/2|||(^10Vg^ilog(2T/£)log(2g/e) + X(0* - 0) 

with probability at least 1 — 5e. Using the bound in (17), we get 

|R(a* - a)|, < IIIRMi/2 III (4(^4 + 2)D^J^^^2qlogi2q/e) + ^^^^2s* + 3K*\ogiK/e 

In view of the inequality^ 

(RMi/2)(RMi/2)T = R[rT(D|. + Dr^^.D-1)R]+R^ 

dR[R^(D^i„.D^l,)R]+R^ 

^ {ma4Rt,a* ■ fit,:S])R[R^R]+R^ 



ter ' 



we get 



|||RMl/2|||2 ^ |||(RMl/2)(RMl/2)T||| 

<Di\Ra*f •IRrR^Rl+R^I 

— I I oo '" L J III 

< Di\Ra*\^ , 

— -^1 loo' 

where the last inequahty fohows from the fact that R[R^R] R^ is an orthogonal projector. 



(41) 



(42) 



X(0* - 0) 



(43) 



(44) 



''We use the notation A >; B and B X A for indicating that the matrix A — B is positive semi-definite. For any matrix 
A, we denote by A+ its Moore-Penrose pseudoinverse. 
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8.5. Proof of Lemma 8.2 

We start by presenting a proof of (38). We have 

< |||DyRMi/2||| . |m1/2rT(d^i^^(^2 _ ij,) + D^i„.Dx</..C)|2 



We remark that 



and that 



M+ =R^[D|,+D-^.D-l]R^R^D|,R =^ |||DyRMi/2||| < 1. 



1 {X,,4,-/R,,a'y 

which imphes that 

|m1/2rTD^^.Dx<^*||; - |^D^^.Dx<^*RMR^Dx0.D^^4 

< fmax— ,, ^5*",t,^ ,„ ^, ^$^nig, (46) 

where Hi = Dj^^.Dx</)*R(R^Dx</,»Dj^^,R) R^Dx0*Dj^^. is the orthogonal projection on the hnear sub- 
space of M"^ spanned by the columns of the matrix Dj^^,Dx(/)*R- By the Cochran theorem, the random variable 
rji = ^ Tli£, is distributed according to the xi distribution. 

Using similar arguments based on matrix inequalities, one checks that 

^(?-yl4 y(i'-im(;'-i) , (47) 

where 112 — Dj^^.R(R^D^^,R) R^D^^, is the orthogonal projection on the linear subspace of R"^ spanned 
by the columns of the matrix D^^*R. 

To further simplify (46), one can remark that under the condition Rt,:a < DiRt,-a* , it holds 



{Rt,a*)^yf + [Rt-a* /Rt-a) " {Xt,(f>* + ^t)^ + D^ 



— <l + DiCt- (48) 



These bounds, combined with (45), yield 

DyRMRT(Dj,^.(|2 _ 1^) + Dj,^.Dx<^*^) ^ < v/(l + 5i|^|^)r;i + /S^- (49) 



2 



One can also notice that Tl2 is a projector on a subspace of dimension at most equal to q, therefore one can 
write II2 — J2'i=i ''^ ^'^J f"^^ some unit vectors v^ G R^. This implies that 



m^i2 l^^(^' - 1^)1' < 9 max \j2veACf - 1) 

^ — ' £— l,...,g I ^ — ' 



ter 



Hence, large deviations of rji and 772 can be controlled using standard tail bounds; see, for instance, 
Laurent and Massart (2000, Lemma 1). This implies that with probability at least 1 — 2e, 



DyRMR^(D^i„.(|2 _ 1^) +D^^.Dx<^*|) < Jl + Di\^\UV^+ V21og(g/e)) + ^JqDl 4\og{2q/e 
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To conclude, it suffices to remark that P(|^|oo < \/21og(2r/e)) > 1 — £. This imphes that 
DyRMRT(Dj,i„. (D| - It)1t + Dj^^.Dx,/,*^) 



< 2jD^\og{2T/e){^+ ^2\og{q/e)) + v/?^! 41og(2<z/e 



< Ad2qD^ log(2T/e) log(g/e) + A^ qD^ \og{2q/e 



<lQ^qDi\og{2T/e)\og{2q/e). 

This completes the proof of the first claim of the lemma. 
Let us now switch to a proof of (39). It is clear that 

IDyRMR^Dvi = |||Mi/2RTDy|||2 

<|||MV2R^(D^+D^i^D^i,.)^/^linil(D|.+D^i,D^L.)-^/^Dv|| 



ter yf + {Rt,:a-* ■ Rt,:^)-'^ ' 
Using the fact that Rt^-o. < DiRt^a* for every t, we obtain 

yl{Rt,cy*? 



|||DyRMR"^D^ 



1 



1 — min ■ 



(50) 



= 1 — min ^^ . (51) 



2|^|^ and to apply the well-known bound on the tails of the Gaussian distribution. 



To complete the proof of the lemma, it suffices to remark that {Xt^:4>* + ^t) < 2{Xt.:4>*) + 2^j < 2\^4>* ^^ 

|2 



